Model interpretation study
Using DALEX library
Create models
Code
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
preprocess = make_column_transformer(
(StandardScaler(), ['age', 'fare', 'parch', 'sibsp']),
(OneHotEncoder(), ['gender', 'class', 'embarked']))Logistic regression model
Code
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('logisticregression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('logisticregression', LogisticRegression())])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['age', 'fare', 'parch', 'sibsp']),
('onehotencoder', OneHotEncoder(),
['gender', 'class', 'embarked'])])['age', 'fare', 'parch', 'sibsp']
StandardScaler()
['gender', 'class', 'embarked']
OneHotEncoder()
LogisticRegression()
Logistic regression model
Code
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('randomforestclassifier',
RandomForestClassifier(max_depth=3, n_estimators=500))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('randomforestclassifier',
RandomForestClassifier(max_depth=3, n_estimators=500))])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['age', 'fare', 'parch', 'sibsp']),
('onehotencoder', OneHotEncoder(),
['gender', 'class', 'embarked'])])['age', 'fare', 'parch', 'sibsp']
StandardScaler()
['gender', 'class', 'embarked']
OneHotEncoder()
RandomForestClassifier(max_depth=3, n_estimators=500)
Gradient boosting model
Code
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('gradientboostingclassifier', GradientBoostingClassifier())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('gradientboostingclassifier', GradientBoostingClassifier())])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['age', 'fare', 'parch', 'sibsp']),
('onehotencoder', OneHotEncoder(),
['gender', 'class', 'embarked'])])['age', 'fare', 'parch', 'sibsp']
StandardScaler()
['gender', 'class', 'embarked']
OneHotEncoder()
GradientBoostingClassifier()
Support vector machine model
Code
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('svc', SVC(probability=True))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('svc', SVC(probability=True))])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['age', 'fare', 'parch', 'sibsp']),
('onehotencoder', OneHotEncoder(),
['gender', 'class', 'embarked'])])['age', 'fare', 'parch', 'sibsp']
StandardScaler()
['gender', 'class', 'embarked']
OneHotEncoder()
SVC(probability=True)
Models’ predictions
Code
import pandas as pd
johnny_d = pd.DataFrame({'gender': ['male'],
'age' : [8],
'class' : ['1st'],
'embarked': ['Southampton'],
'fare' : [72],
'sibsp' : [0],
'parch' : [0]},
index = ['JohnnyD'])
henry = pd.DataFrame({'gender' : ['male'],
'age' : [47],
'class' : ['1st'],
'embarked': ['Cherbourg'],
'fare' : [25],
'sibsp' : [0],
'parch' : [0]},
index = ['Henry'])Instance Level
Break-down plots for additive attibutions
This concept answer the question: which variables contribute to this result the most?
Code
Preparation of a new explainer is initiated
-> data : 2207 rows 7 cols
-> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
-> target variable : 2207 values
-> model_class : sklearn.ensemble._forest.RandomForestClassifier (default)
-> label : Titanic RF Pipeline
-> predict function : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
-> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
-> predicted values : min = 0.174, mean = 0.322, max = 0.889
-> model type : classification will be used (default)
-> residual function : difference between y and yhat (default)
-> residuals : min = -0.829, mean = -5.07e-05, max = 0.824
-> model_info : package sklearn
A new explainer has been created!
| variable_name | variable_value | variable | cumulative | contribution | sign | position | label | |
|---|---|---|---|---|---|---|---|---|
| 0 | intercept | 1 | intercept | 0.322207 | 0.322207 | 1.0 | 8 | Titanic RF Pipeline |
| 1 | class | 1st | class = 1st | 0.394536 | 0.072328 | 1.0 | 7 | Titanic RF Pipeline |
| 2 | embarked | Cherbourg | embarked = Cherbourg | 0.423858 | 0.029323 | 1.0 | 6 | Titanic RF Pipeline |
| 3 | fare | 25.0 | fare = 25.0 | 0.429974 | 0.006116 | 1.0 | 5 | Titanic RF Pipeline |
| 4 | sibsp | 0.0 | sibsp = 0.0 | 0.428781 | -0.001193 | -1.0 | 4 | Titanic RF Pipeline |
| 5 | parch | 0.0 | parch = 0.0 | 0.423522 | -0.005259 | -1.0 | 3 | Titanic RF Pipeline |
| 6 | age | 47.0 | age = 47.0 | 0.416755 | -0.006766 | -1.0 | 2 | Titanic RF Pipeline |
| 7 | gender | male | gender = male | 0.308259 | -0.108496 | -1.0 | 1 | Titanic RF Pipeline |
| 8 | prediction | 0.308259 | 0.308259 | 1.0 | 0 | Titanic RF Pipeline |
Break-down plots for additive interactions
Interaction (deviation from additivity) means that the effect of an explanatory variable depends on the value(s) of other variable(s).
Code
| variable_name | variable_value | variable | cumulative | contribution | sign | position | label | |
|---|---|---|---|---|---|---|---|---|
| 0 | intercept | 1 | intercept | 0.322207 | 0.322207 | 1.0 | 5 | Titanic RF Pipeline |
| 1 | class:gender | 1st:male | class:gender = 1st:male | 0.295512 | -0.026696 | -1.0 | 4 | Titanic RF Pipeline |
| 2 | fare:embarked | 25.0:Cherbourg | fare:embarked = 25.0:Cherbourg | 0.328518 | 0.033006 | 1.0 | 3 | Titanic RF Pipeline |
| 3 | parch:sibsp | 0.0:0.0 | parch:sibsp = 0.0:0.0 | 0.318294 | -0.010224 | -1.0 | 2 | Titanic RF Pipeline |
| 4 | age | 47.0 | age = 47.0 | 0.308259 | -0.010035 | -1.0 | 1 | Titanic RF Pipeline |
| 5 | prediction | 0.308259 | 0.308259 | 1.0 | 0 | Titanic RF Pipeline |
Shapley Additice Explanations (SHAP) for average attributions
To remove the influence of the ordering of the variables.
| variable | contribution | variable_name | variable_value | sign | label | B | |
|---|---|---|---|---|---|---|---|
| 0 | embarked = Cherbourg | 0.021287 | embarked | Cherbourg | 1.0 | Titanic RF Pipeline | 1 |
| 1 | sibsp = 0.0 | 0.000773 | sibsp | 0 | 1.0 | Titanic RF Pipeline | 1 |
| 2 | gender = male | -0.091533 | gender | male | -1.0 | Titanic RF Pipeline | 1 |
| 3 | age = 47.0 | -0.005456 | age | 47 | -1.0 | Titanic RF Pipeline | 1 |
| 4 | parch = 0.0 | -0.006794 | parch | 0 | -1.0 | Titanic RF Pipeline | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2 | embarked = Cherbourg | 0.022800 | embarked | Cherbourg | 1.0 | Titanic RF Pipeline | 0 |
| 3 | parch = 0.0 | -0.005639 | parch | 0 | -1.0 | Titanic RF Pipeline | 0 |
| 4 | age = 47.0 | -0.005172 | age | 47 | -1.0 | Titanic RF Pipeline | 0 |
| 5 | fare = 25.0 | 0.004127 | fare | 25 | 1.0 | Titanic RF Pipeline | 0 |
| 6 | sibsp = 0.0 | -0.001006 | sibsp | 0 | -1.0 | Titanic RF Pipeline | 0 |
182 rows × 7 columns
Local Interpretable Model-agnostic Explanations (LIME)
Break-down (BD) plots and Shapley values, are most suitable for models with a small or moderate number of explanatory variables. The most popular example of such sparse explainers is the Local Interpretable Model-agnostic Explanations (LIME) method and its modifications.
In the first step, we read the Titanic data and encode categorical variables. In this case, we use the simplest encoding for gender, class, and embarked, i.e., the label-encoding.
Code
import dalex as dx
titanic = dx.datasets.load_titanic()
X = titanic.drop(columns='survived')
y = titanic.survived
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
X['gender'] = le.fit_transform(X['gender'])
X['class'] = le.fit_transform(X['class'])
X['embarked'] = le.fit_transform(X['embarked'])In the next step we train a random forest model.
Code
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
It is time to define the observation for which model prediction will be explained. We write Henry’s data into pandas.Series object.
The lime library explains models that operate on images, text, or tabular data. In the latter case, we have to use the LimeTabularExplainer module.
The result is an explainer that can be used to interpret a model around specific observations. In the following example, we explain the behaviour of the model for Henry. The explain_instance() method finds a local approximation with an interpretable linear model. The result can be presented graphically with the show_in_notebook() method.
Ceteris-paribus profiles
“Ceteris paribus” is a Latin phrase meaning “other things held constant” or “all else unchanged”.
Ceteris-paribus (CP) profiles show how a model’s prediction would change if the value of a single exploratory variable changed.
Code
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('randomforestclassifier',
RandomForestClassifier(max_depth=3, n_estimators=500))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('randomforestclassifier',
RandomForestClassifier(max_depth=3, n_estimators=500))])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['age', 'fare', 'parch', 'sibsp']),
('onehotencoder', OneHotEncoder(),
['gender', 'class', 'embarked'])])['age', 'fare', 'parch', 'sibsp']
StandardScaler()
['gender', 'class', 'embarked']
OneHotEncoder()
RandomForestClassifier(max_depth=3, n_estimators=500)
Code
Preparation of a new explainer is initiated
-> data : 2207 rows 7 cols
-> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
-> target variable : 2207 values
-> model_class : sklearn.ensemble._forest.RandomForestClassifier (default)
-> label : Titanic RF Pipeline
-> predict function : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
-> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
-> predicted values : min = 0.17, mean = 0.321, max = 0.901
-> model type : classification will be used (default)
-> residual function : difference between y and yhat (default)
-> residuals : min = -0.834, mean = 0.0008, max = 0.827
-> model_info : package sklearn
A new explainer has been created!
| gender | age | class | embarked | fare | sibsp | parch | _original_ | _yhat_ | _vname_ | _ids_ | _label_ | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Henry | male | 47.000000 | 1st | Cherbourg | 25.0 | 0.0 | 0.00 | male | 0.300351 | gender | Henry | Titanic RF Pipeline |
| Henry | female | 47.000000 | 1st | Cherbourg | 25.0 | 0.0 | 0.00 | male | 0.822174 | gender | Henry | Titanic RF Pipeline |
| Henry | male | 0.166667 | 1st | Cherbourg | 25.0 | 0.0 | 0.00 | 47 | 0.420320 | age | Henry | Titanic RF Pipeline |
| Henry | male | 0.905000 | 1st | Cherbourg | 25.0 | 0.0 | 0.00 | 47 | 0.420320 | age | Henry | Titanic RF Pipeline |
| Henry | male | 1.643333 | 1st | Cherbourg | 25.0 | 0.0 | 0.00 | 47 | 0.417190 | age | Henry | Titanic RF Pipeline |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Henry | male | 47.000000 | 1st | Cherbourg | 25.0 | 0.0 | 8.64 | 0 | 0.342079 | parch | Henry | Titanic RF Pipeline |
| Henry | male | 47.000000 | 1st | Cherbourg | 25.0 | 0.0 | 8.73 | 0 | 0.342079 | parch | Henry | Titanic RF Pipeline |
| Henry | male | 47.000000 | 1st | Cherbourg | 25.0 | 0.0 | 8.82 | 0 | 0.342079 | parch | Henry | Titanic RF Pipeline |
| Henry | male | 47.000000 | 1st | Cherbourg | 25.0 | 0.0 | 8.91 | 0 | 0.342079 | parch | Henry | Titanic RF Pipeline |
| Henry | male | 47.000000 | 1st | Cherbourg | 25.0 | 0.0 | 9.00 | 0 | 0.342079 | parch | Henry | Titanic RF Pipeline |
419 rows × 12 columns
Code
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('logisticregression', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('standardscaler',
StandardScaler(),
['age', 'fare', 'parch',
'sibsp']),
('onehotencoder',
OneHotEncoder(),
['gender', 'class',
'embarked'])])),
('logisticregression', LogisticRegression())])ColumnTransformer(transformers=[('standardscaler', StandardScaler(),
['age', 'fare', 'parch', 'sibsp']),
('onehotencoder', OneHotEncoder(),
['gender', 'class', 'embarked'])])['age', 'fare', 'parch', 'sibsp']
StandardScaler()
['gender', 'class', 'embarked']
OneHotEncoder()
LogisticRegression()
Code
Preparation of a new explainer is initiated
-> data : 2207 rows 7 cols
-> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
-> target variable : 2207 values
-> model_class : sklearn.linear_model._logistic.LogisticRegression (default)
-> label : Titanic RL Pipeline
-> predict function : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
-> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
-> predicted values : min = 0.009, mean = 0.322, max = 0.97
-> model type : classification will be used (default)
-> residual function : difference between y and yhat (default)
-> residuals : min = -0.96, mean = -5.83e-07, max = 0.964
-> model_info : package sklearn
A new explainer has been created!
Dataset Level
Model-performance Measures
Code
Preparation of a new explainer is initiated
-> data : 2207 rows 7 cols
-> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
-> target variable : 2207 values
-> model_class : sklearn.ensemble._forest.RandomForestClassifier (default)
-> label : Titanic RF Pipeline
-> predict function : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
-> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
-> predicted values : min = 0.17, mean = 0.321, max = 0.901
-> model type : classification will be used (default)
-> residual function : difference between y and yhat (default)
-> residuals : min = -0.834, mean = 0.0008, max = 0.827
-> model_info : package sklearn
A new explainer has been created!
Code
| recall | precision | f1 | accuracy | auc | |
|---|---|---|---|---|---|
| Titanic RF Pipeline | 0.500703 | 0.765591 | 0.605442 | 0.78976 | 0.804917 |
Code
import plotly.express as px
from sklearn.metrics import roc_curve, auc
y_score = titanic_rf_exp.predict(X)
fpr, tpr, thresholds = roc_curve(y, y_score)
fig = px.area(x=fpr, y=tpr,
title=f'ROC Curve (AUC={auc(fpr, tpr):.4f})',
labels=dict(x='False Positive Rate', y='True Positive Rate'),
width=700, height=500)
fig.add_shape(
type='line', line=dict(dash='dash'),
x0=0, x1=1, y0=0, y1=1)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.update_xaxes(constrain='domain')
fig.show()Code
df = pd.DataFrame({'False Positive Rate': fpr,
'True Positive Rate': tpr }, index=thresholds)
df.index.name = "Thresholds"
df.columns.name = "Rate"
fig_thresh = px.line(df,
title='TPR and FPR at every threshold', width=700, height=500)
fig_thresh.update_yaxes(scaleanchor="x", scaleratio=1)
fig_thresh.update_xaxes(range=[0, 1], constrain='domain')
fig_thresh.show()Variable-importance Measures
model-specific Vs. model-agnostic
Code
Preparation of a new explainer is initiated
-> data : 2207 rows 7 cols
-> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
-> target variable : 2207 values
-> model_class : sklearn.ensemble._forest.RandomForestClassifier (default)
-> label : Titanic RF Pipeline
-> predict function : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
-> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
-> predicted values : min = 0.17, mean = 0.321, max = 0.901
-> model type : classification will be used (default)
-> residual function : difference between y and yhat (default)
-> residuals : min = -0.834, mean = 0.0008, max = 0.827
-> model_info : package sklearn
A new explainer has been created!
| variable | dropout_loss | label | |
|---|---|---|---|
| 0 | _full_model_ | 0.194832 | Titanic RF Pipeline |
| 1 | sibsp | 0.197510 | Titanic RF Pipeline |
| 2 | parch | 0.198713 | Titanic RF Pipeline |
| 3 | embarked | 0.198997 | Titanic RF Pipeline |
| 4 | fare | 0.203987 | Titanic RF Pipeline |
| 5 | age | 0.208972 | Titanic RF Pipeline |
| 6 | class | 0.263713 | Titanic RF Pipeline |
| 7 | gender | 0.357970 | Titanic RF Pipeline |
| 8 | _baseline_ | 0.489277 | Titanic RF Pipeline |
Code
| variable | dropout_loss | label | |
|---|---|---|---|
| 0 | _full_model_ | 0.199035 | Titanic RF Pipeline |
| 1 | wealth | 0.267039 | Titanic RF Pipeline |
| 2 | personal | 0.390928 | Titanic RF Pipeline |
| 3 | _baseline_ | 0.508033 | Titanic RF Pipeline |
Partial-dependence profiles
Code
Preparation of a new explainer is initiated
-> data : 2207 rows 7 cols
-> target variable : Parameter 'y' was a pandas.Series. Converted to a numpy.ndarray.
-> target variable : 2207 values
-> model_class : sklearn.ensemble._forest.RandomForestClassifier (default)
-> label : Titanic RF Pipeline
-> predict function : <function yhat_proba_default at 0x7fd2d838a8b0> will be used (default)
-> predict function : Accepts only pandas.DataFrame, numpy.ndarray causes problems.
-> predicted values : min = 0.17, mean = 0.321, max = 0.901
-> model type : classification will be used (default)
-> residual function : difference between y and yhat (default)
-> residuals : min = -0.834, mean = 0.0008, max = 0.827
-> model_info : package sklearn
A new explainer has been created!
| _vname_ | _label_ | _x_ | _yhat_ | _ids_ | |
|---|---|---|---|---|---|
| 0 | age | Titanic RF Pipeline | 0.166667 | 0.414360 | 0 |
| 1 | age | Titanic RF Pipeline | 0.905000 | 0.416245 | 0 |
| 2 | age | Titanic RF Pipeline | 1.643333 | 0.412445 | 0 |
| 3 | age | Titanic RF Pipeline | 2.381667 | 0.412445 | 0 |
| 4 | age | Titanic RF Pipeline | 3.120000 | 0.413758 | 0 |
| ... | ... | ... | ... | ... | ... |
| 197 | fare | Titanic RF Pipeline | 491.578272 | 0.364379 | 0 |
| 198 | fare | Titanic RF Pipeline | 496.698879 | 0.364379 | 0 |
| 199 | fare | Titanic RF Pipeline | 501.819486 | 0.364379 | 0 |
| 200 | fare | Titanic RF Pipeline | 506.940093 | 0.364379 | 0 |
| 201 | fare | Titanic RF Pipeline | 512.060700 | 0.364379 | 0 |
202 rows × 5 columns